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MORPHOLOGICAL DISAMBIGUATION 

FIELD OF THE INVENTION 

The present invention relates generally to 
computer-based linguistic processing and specifically to 
5 methods for resolving which of a number of meanings of a 
given word in a text is likely to be the correct one, 
particularly in morphologically-rich languages such as 
Hebrew, 

BACKGROUND OF THE INVENTION 

10 With the explosive growth in the volume of available 

on-line information, the efficiency of information 
retrieval (IR) ' systems becomes increasingly important. 
IR systems generally operate on a canonical 
representation of documents, called a ''profile,'' 

15 consisting of a list of indexing units. For text 
searching, the indexing units are typically words. The 
profiles are stored in an inverted index, enabling 
documents to be retrieved by matching the terms in a 
query phrase to the words in the index. Many IR 
20 applications have been developed. One example is the 
GURU system described by Maarek et al., in an article 
entitled "An Information Retrieval Approach for 
Automatically Constructing Software Libraries," in IEEE 
Transactions on Software Engineering 17(8), pages 800-813 
25 (August, 1991), which is incorporated herein by 
reference . 

For efficient and thorough searching, it is 
desirable that variants of a given word, such as singular 
and plural forms of a noun, or different tenses of a 
30 verb, be mapped to the same indexing unit. In other 
words, a lexical analysis of the words should be invoked 
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so that, ideally, all of them are represented by the same 
base word. The simplest tool for lexical analysis is a 
stemmer, which derives base words using ad hoc rules for 
stripping suffices and handling exceptional word forms. 
A more precise method is morphological analysis, using a 
dictionary and a set of declination rules to find the 
lexical base forms of the words in the document. The 
base form of a given word is referred to as its '^lemma." 

English language morphology is simple enough so that 
even stemmers do an adequate job of analysis for most 
applications. Hebrew, however, like other Semitic 

languages, is highly synthetic and rich in variants. In 
standard Hebrew; writing, not all of the vowels are 
represented, while several letters may represent either a 
vowel or a consonant. A given lexical root may be 
declined by insertion, deletion, substitution or 
affixation of letters. It is often difficult to 

determine which letters in a word belong to the lemma, 
and which have been added. For example, the Hebrew word 
mishtara can be analyzed correctly as any of: 

• Mishtara (police) 

• Mishtar-ha (her regime) 

• Mi-hshtar-ha (from her bill) 

The result of this complex morphology is a high 
level of ambiguity, which cannot be resolved 
unequivocally without contextual information. Therefore, 
Hebrew morphological analyzers typically return multiple 
possible analyses for a given word. An example of a 
morphological analyzer with Hebrew capabilities is the 
POE LanguageWare system (version 2.6), offered by the IBM 
Software Solutions Division, of Research Triangle Park, 
North Carolina. For each legal Hebrew input string, this 
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analyzer returns all legal lexical candidates as possible 
analyses of the given string, along with the following 
characteristics of each candidate: 

• Lemma - the base form used for indexing. 

• Category - categorization of the lemma according 
to part of speech, gender, plural inflections, legal 
set of prefixes and legal set of suffixes. 

• Part of speech. 

• Prefix - attached particles. 

• Correct form - (optional) the input word with 
additional vowel letters added to .enable the given 
analysis. 

• Number, gender, person. 

• Status - (for non-verbs only) - whether this lemma 
is in a construct (nismach) or in its absolute 
(nifrad) form. 

• Tense - (for verbs only), 

• Conjugation pattern (jbinyan and gizra - for verbs 
only) . 

• lnf_num, inf_gen, inf_person - number, gender and 
person of pronominal suffix, added to Hebrew words 
to indicate possessives or verb objects, for 
example. 

On average, this analyzer returns 2.15 analysis for each 
input string. 

A number of methods have been proposed for resolving 
the ambiguity of Hebrew morphological analysis. Most 
methods use contextual information. Levinger et al. 
describe a context-free method in an article entitled 
''Learning Morpho-Lexical Probabilities from an Untagged 
Corpus with an Application to Hebrew, " published in 
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Computational Linguistics 2(3), pages 383-404 (1993), 
which is incorporated herein by reference. In studying a 
large Hebrew text base, the authors found that 55% of the 
words had more than one morphological reading, and 33% 
5 had more than two. They describe a method of 

disambiguation based on gathering statistics on the text 
base, so as to determine, for each word, a morpho-lexical 
probability for each of its alternative analyses, 
indicating the likelihood that the analysis is correct. 
10 An analysis of a given word that has a significantly 
higher probability than the alternatives is taken to be 
the correct one, regardless of the context and the form 
z: of the word. 

m 
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SUMMARY OF THE INVENTION 

In preferred embodiments of the present invention, a 
Hebrew morphological disambiguator receives the output of 
a morphological analyzer and prunes the number of 
5 candidate analyses for each word. The pruning is based 
on the morphological patterns of the different analyses, 
rather than on the words themselves as in the 
above-mentioned system described by Levinger et al. The 
''pattern'' of a word, in this context, consists of a 
10 certain combination of linguistic characteristics, which 
are typically provided by the morphological analyzer. 

Preferably, these characteristics include the part of 

m 

speech, prefix, • number, gender, person and, in the case 
SI of verbs, the tense and conjugation model. Statistical 

j _ 3 

i'l 15 data from a large corpus of text are used to determine a 
01 frequency of occurrence of each possible pattern, 

independent of the base words, or lemma?, to which the 

0^ pattern is applied. The disambiguator prunes out those 

RJ 

Ifk candidates whose pattern occurs with low frequency. 

Q 20 . Pattern-based disambiguation is advantageous, by 

comparison with word-based schemes, because there are far 
fewer possible patterns than there are words. As a 
result, pattern statistics are more stable and reliable 
and easier to handle than word statistics. For example, 
25 in a corpus of 10 million Hebrew words studied by the 
inventors, only 2,300 different patterns were found, as 
opposed to 25,000 unique words. The methods provided by 
preferred embodiments of the present invention thus 
enables context-free disambiguation of text with improved 
30 efficiency and confidence by comparison with methods 
known in the art. Alternatively or additionally, the 
principles of the present invention may be implemented in 
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conjunction with context-dependent disambiguation 
schemes . 

In some preferred embodiments of the present 
invention, the disambiguator is used as part of a system 
5 for searching a corpus of text documents, such as the 
above-mentioned GURU system. Preferably, the 

disambiguator is used to prune the number of analyses of 
the words in the documents that are included in a search 
index of the corpus. It is then used again to analyze 
10 the words in a user query, so as to determine the lemmas 
to search for in the index. 

Alternatively, the present invention may be used in 
^= other linguistic processing applications, such as 

computerized natural language processing and speech 
15 analysis, as well as spell-checking. Dealing with Hebrew 
m spelling is a particularly difficult problem, since 

L almost any' string can be interpreted as a legal word. In 

5l a preferred embodiment of the present invention, a 

^ spell-checking program uses pattern-based morphological 

Q 20 analysis, as described herein, to identify strings having 
rare morphological patterns as potential misspellings. 

While preferred embodiments are described herein 
with reference to the Hebrew language, the principles of 
morphological disambiguation described herein are also 
25 applicable to other morphologically-rich languages, 
including particularly other Semitic languages, such as 
Arabic . 

There is therefore provided, in accordance with a 
preferred embodiment of the present invention, a method 
30- for morphological disambiguation, including: 
receiving an input string; 
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morphologically analyzing the string to generate a 
list of candidate analyses of the string, each candidate 
analysis including a respective word and a linguistic 
pattern of the word; and 
5 evaluating the pattern of each of the analyses 

against a predefined criterion in order to select one or 
more of the analyses from the list. 

Preferably, receiving the input string includes 
receiving a word in a Semitic language, most preferably 
10 in Hebrew. 

Further preferably, the linguistic pattern includes 
a specification of at least one characteristic of the 
word, selected .from a set of characteristics including a 
part of speech, prefix, number, gender and person of the 
hi 15 word. Most preferably, the specification of the at least 
p one characteristic includes a specification of all of the 

g characteristics in the set. Additionally or 

5| alternatively, when the base word includes a verb, the 

nl linguistic pattern further includes a designation of a 

P| 20 tense and conjugation pattern of the verb, 

P In a preferred embodiment, each of the analyses has 

a lemma and a paradigm determined by the word and the 
linguistic pattern thereof, and evaluating the pattern 
includes eliminating one of the analyses from the list if 
25 it has the same lemma and paradigm as another of the 
analyses . 

Preferably, evaluating the pattern includes 
determining a relative frequency of occurrence of the 
pattern of each of the analyses, and selecting the at 
30 least one of the analyses whose frequency of occurrence 
is above a predetermined threshold. Most preferably, 
determining the relative frequency of occurrence includes 
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morphologically analyzing a corpus of text and finding 
the frequency of occurrence of the pattern in the corpus, 
wherein determining the relative frequency of occurrence 
includes storing in a table the frequency of occurrence 
5 found in the corpus, and looking up the pattern in the 
table. Additionally or alternatively, selecting the at 
least one of the analyses includes setting the threshold 
so as to control how many of the analyses from the list 
are selected. Further additionally or alternatively, 
10 selecting the at least one of the analyses includes 
selecting the at least one of the analyses based on the 
pattern thereof, and substantially independently of the 



respective word 



q\ In a preferred embodiment, the method includes 

15 searching in a corpus of text for a match to the input 

III 

01 string using the one or more selected analyses. In 

another preferred embodiment, the method includes 

g\ checking a document for spelling errors using the one or 

more selected analyses. 

Q 20 There is also provided, in accordance with a 

preferred embodiment of the present invention, a method 
for searching a corpus of text made up of words, 
including : 

morphologically analyzing the words in the corpus to 
25 generate, for each of at least some of the words, a list 
of" candidate analyses, each candidate analysis including 
a respective lemma and a linguistic pattern relating the 
lemma to the analyzed word; 

evaluating the pattern of each of the analyses 
30 against a predefined criterion in order to select one or 
more of the analyses from the list for each of the 
analyzed words; 
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entering the lemmas of the selected analyses in an 
index of the corpus; and 

applying a search query to the index. 
Preferably, applying the search query includes: 
5 receiving an input text string; 

morphologically analyzing and disambiguating the 
string to generate one or more search lemmas for the 
string; and 

comparing the search lemmas to the index. 
10 There is further provided, in accordance with a 

preferred embodiment of the present invention, a computer 
Ci software product, including a computer-readable medium in 

^ which program instructions are stored, which 

Q instructions, when read by a computer, cause the computer 

15 to morphologically analyze an input string to generate a 
nJ list of candidate analyses of the string, each candidate 

J analysis including a respective word and a linguistic 

S pattern of the word, and to evaluate the pattern of each 

m of the analyses against a predefined criterion in order 

20 to select one or more of the analyses from the list. 

There is additionally provided, in accordance with a 
preferred embodiment of the present invention, a computer 
software product, including a computer-readable medium in 
which program instructions are stored, which 
25 instructions, when read by a computer, cause the computer 
to morphologically analyze the words in the corpus to 
generate, for each of at least some of the words, a list 
of candidate analyses, each candidate analysis including 
a respective lemma and a linguistic pattern relating the 
30 lemma to the analyzed word, to evaluate the pattern of 
each of the analyses against a predefined criterion in 
order to select one or more of the analyses from the list 
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for each of the analyzed words, to enter the lemmas of 
the selected analyses in an index of the corpus, and to 
apply a search query to the index. 

There is furthermore provided, in accordance with a 
5 preferred embodiment of the present invention, apparatus 
for morphological disambiguation, including a linguistic 
processor, which is adapted to receive an input string, 
to morphologically analyze the string to generate a list 
of candidate analyses of the string, each candidate 
10 analysis including a respective word and a linguistic 
pattern of the word, and to evaluate the pattern of each 
of the analyses against a predefined criterion in order 
0 to select one or more of the analyses from the list. 

t| There is moreover provided, in accordance with a 

15 preferred embodiment of the present invention, apparatus 
Ki for searching a corpus of text made up of words, 

including a linguistic processor, which is adapted to 

Ci 

m morphologically analyze the words in the corpus to 

generate, for each of at least some of the words, a list 
p 20 of candidate analyses, each candidate analysis including 
a respective lemma and a linguistic pattern relating the 
lemma to the analyzed word, to evaluate the pattern of 
each of the analyses against a predefined criterion in 
order to select one or more of the analyses from the list 
25 for each of the analyzed words, to enter the lemmas of 
the selected analyses in an index of the corpus, and to 
apply a search query to the index. 

The present invention will be more fully understood 
from the following detailed description of the preferred 
30 embodiments thereof, taken together with the drawings in 
which : 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a schematic, pictorial illustration of a 
system for linguistic analysis with morphological 
disambiguation, in accordance with a preferred embodiment 
of the present invention; 

Fig. 2 is a block diagram showing functional details 
of the system of Fig. 1, in accordance with a preferred 
embodiment of the present invention; 

Fig. 3 is a flow chart that schematically 
illustrates a method for gathering pattern statistics, in 
accordance with a preferred embodiment of the present 
invention; 

Fig. 4 is a flow chart that schematically 
illustrates a method for morphological disambiguation, in 
accordance with a preferred embodiment of the present 
invention; and 

Figs. 5 and 6 are graphic plots illustrating results 
obtained from morphological disambiguation of words in a 
corpus of text using the method of Fig. 4. 
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DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Fig. 1 is a schematic, pictorial illustration of a 
system 20 for linguistic processing, in accordance with a 
preferred embodiment of the present invention. System 20 
5 comprises a computer processor 22, with a text input 
device, such as a keyboard 24, and an output device, such 
as a display 26. Alternatively or additionally, the 
processor may receive its input and/or output via a 
network or by any other suitable means. System 20 
10 typically performs its functions, described in detail 
hereinbelow, under the control of software running on 
processor 22. This software may be downloaded to the 
& processor over a network or, alternatively, it may be 

5{ provided on tangible means, such as CD-ROM or 

Ul 15 non-volatile memory. 

g| In typical operation, system 20 operates on a corpus 

of text documents, which are stored in one or more 
g| Storage devices 28, either local to processor 22 or 

R^- accessed via a network. System 20 processes the 

q\ 20 documents to determine the lemmas of the words in the 
text and, preferably, to build an index to the corpus 
based on these lemmas. A user of system 20 inputs a 
search string 30, such as the Hebrew word ""^mishtara ^ 
mentioned in the Background of the Invention. The system 
25 finds the lemma (or multiple candidate lemmas) of the 
search string and uses it to retrieve matching documents 
from the corpus, based on the index. One such match 32, 
'^hamlshtartiyim'' (a plural, adjectival form of ''police,'' 
prefixed by the definite article) is shown on display 26 
30 by way of example. While string matching and other naive 
algorithms would fail to find this match, morphological 
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processing based on the methods described herein enables 
matches like this one to be found with good precision. 

Fig. 2 is a block diagram that schematically 
illustrates a set 40 of functional blocks used in 
5 processing performed by system 20, in accordance with a 
preferred embodiment of the present invention. In 
practice, the functions of all of these blocks are 
preferably carried out by the software running on 
processor 22, although some of the processing functions 
10 may also be performed by a remote server, for example. 
Each Hebrew word to be processed is input to a 
"Zl morphological analyzer 42. Preferably, analyzer 42 

W comprises the POE system mentioned in the Background of 

O 

q\ the Invention, although substantially any Hebrew language 

to 15 morphological analyzer known in the art may be used. For 
g| each input word, analyzer 42 typically generates multiple 

candidate analyses, each comprising a lemma and 
S| linguistic characteristics relating the lemma to the 

-2f input word, as described above. 

□ 20 The output of analyzer 42 is processed by a filter 

44, in order to remove variant analyses that are not 
considered relevant for the purpose of indexing. 
Preferably, the filter removes corrected forms of words, 
i.e., analyses that the morphological analyzer has 

25 inferred by adding optional vowel letters that are absent 
in "the original input string. This rule is motivated by 
the assumption that generally only the original string is 
a candidate to be indexed (or to be searched) . 

Additionally or alternatively, the filter eliminates 

30 multiple analyses having the same lemma and paradigm, 
leaving only one representative base form for each such 
set. The ''paradigm" of a word in this context is 
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preferably taken to be its part of speech (noun, verb, 
etc.), with the addition of its conjugation pattern 
(binyan) in the case of verbs. The reason for this rule 
is that different inflections of the same lemma do not 
5 add information that should be stored in the index. For 
example, the words inyani (my interest) , inyanay (my 
interests) and inyanei (the interests of) are all 
constructs of the same lemma and paradigm: inyan - 
interest (noun) . These three variants are typically 
10 spelled identically in Hebrew. Filter 42 removes two of 
the variants. 

The application of these two filtering rules 
together was found to reduce the average number of 
analyses per word from 2,15 to 1.91. Alternatively, 

15 other filtering algorithms, or no filtering, may be used. 

The filtered list of analyses is input to a 
statistical disambiguator 46. The disambiguator decides 
which of the candidate analyses are likely to be correct 
based on a statistical base 48 of morphological patterns. 

20 The ■ morphological pattern of a given analysis is 
preferably defined as a tuple of the values of the 
following characteristics: 
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TABLE I - PATTERN CHARACTERISTICS 





Field 


Number 


Values 




Part of 


12 


Noun, verb, adjective, number, pronoun. 




speech 




preposition, con j unction, interrogative, 
particle, adverb, abbreviation, proper 
name 




Prefix 


8 


Letters /ne/n, shin^ heh^ vav, kaf^ lamed^ 
bet, or none. (For combination prefixes, 
only the last letter before the lemma is 








used . ) 




Number 


2 


Singular, plural 




Gender 


3 


Masculine, feminine, or both 


bl 


Person 


4 


First, second, third or all 


m 
m 


Tense 


5 


Past, present, future, imperative. 








infinitive 


u 
& 


Con j u- 


7 


Paai, nifal^ piel^ pual^ hifil, hufalr 




gation 




hitpael (standard Hebrew conjugation 


i 






patterns ) 


el 


Status 


2 


Construct , absolute 




Pronoun 


11 


Legal combinations of inf num, inf gen. 




suffix 




inf_person for constructs; null for 
absolute forms. 



Tense and conjugation apply only to verbs, while status 
5 applies only to non-verbs. This combination of 

characteristics was found to be convenient and useful in 
analyzing Hebrew morphology. It will be understood, 
however, that other combinations and sub-combinations of 
characteristics may also be used, including properties 
10 not listed in the table above. Those skilled in the art 
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will recognize appropriate characteristics to use in 
defining morphological patterns for languages other than 
Hebrew . 

Fig. 3 is a flow chart that schematically 
illustrates a method for building statistical base 48, in 
accordance with a preferred embodiment of the present 
invention. This method generates, for each pattern 
tuple, a frequency of occurrence that indicates, for the 
purposes of disambiguator 46, a likelihood that an 
analysis having this pattern is the correct one. The 
frequency is independent of the lemma to which the 
pattern is applied. At a corpus input step 50, a sample 
corpus of text is received for processing. Analyzer 42 
is used to find pattern tuples of the words in the 
sample, at a pattern finding step 52. The inventors used 
a corpus of 10 million Hebrew words, among which the 
analyzer found 2,300 different patterns (as opposed to 
25,000 different lemmas). 

In order to generate frequency statistics, ambiguous 
words, for which the analyzer returned multiple analyses, 
are preferably removed from the sample, at an ambiguity 
elimination step 54. This step reduced the initial 10 
million words in the inventors' corpus to about 4.5 
million words. At a counting step 56, a counter is 
incremented for each instance of each legal pattern that 
is encountered among the remaining, unambiguous words. 
The final count values are preferably hashed, for 
efficient retrieval, and are stored in a global pattern 
table in base 48, at a storage step 58. 

Fig. 4 is a flow chart that schematically 
illustrates a method for analyzing and disambiguating an 
input word in system 20, in accordance with a preferred 



IL9-2000-0010 



16 



embodiment of the present invention. At a morphology 
step 60, analyzer 42 generates a morphological analysis 
of the word, typically including multiple candidate 
analyses. Filter 44 operates on the analyses to remove 
corrected forms, at a first filtering step 62, and to 
eliminate duplicate analyses with the same lemma and 
paradigm, at a second filtering step 64. Steps 60, 62 
and 64 were described in detail hereinabove. 

At a decision step 65, disambiguator 46 determines 
how the candidate analyses are to be handled, depending 
upon the number of analyses delivered by filter 44. If 
no legal analysis was found by analyzer 42, the 
disambiguator simply returns the base string that was 
input to system' 20, at a base return step 66. If the 
filter delivered a single legal analysis, the 
disambiguator returns the lemma of this analysis, at a 
lemma return step 68. On the other hand, if multiple 
candidate analyses were found, the disambiguator finds 
the pattern tuple for each analysis, at a pattern finding 
step 70. It looks up the tuples in the pattern table of 
pattern base 4 6 to find their respective frequencies, at 
a lookup step 72. A relative frequency is calculated for 
each of the candidate patterns, at a relative frequency 
determination step 74. The relative frequency for each 
pattern is preferably given by the frequency of that 
pattern, as listed in the global table, divided by the 
sum of the frequencies of all of the patterns that were 
found for the current input word. 

At a sorting step 76, the relative frequencies are 
compared to a threshold parameter e. The choice of the 
value of s depends on how drastically the list of 
candidate analyses is to be pruned. Analyses with 
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relative frequencies below the threshold are rejected, at 
a rejection step 78. The lemmas of all analyses having 
frequencies above the threshold are returned at step 68, 
These lemmas are typically used in building a search 
index for documents in a corpus or for searching the 
index thereafter, based on a given query word or words. 
When multiple lemmas are returned by step 76, their 
relative frequencies are preferably returned, as well, 
for use in the search application. Since the relevance 
score of a document retrieved in a search typically 
depends on the frequency of occurrence of the query terms 
inside the document, and some of the terms will have 
multiple lemmas-, the search would be biased in favor of 
ambiguous terms if all of the lemmas were allowed to 
contribute equally to the score. Therefore, the relative 
frequencies of the lemmas are preferably used as a 
weighting factor in computing the relevance scores. 

Fig. 5 is a plot showing the number of analyses per 
word passed by disambiguator 4 6 as a function of the 
chosen threshold parameter e, in a sample of 16,000 words 
tested by the inventors. A curve 82 shows the percentage 
of words out of the total sample for which the 
disaiTtbiguator returned a single analysis. Curves 84, 86 
and 88 show the percentages for returning two, three, or 
four or more analyses, respectively. At s = 0.1, for 
example, more than 75% of the input words receive only a 
single analysis, and less than 5% of the words have more 
than two analyses. For s = 0.5, the disambiguator returns 
only the most likely analysis, and prunes out all of the 
rest . 

Fig. 6 is a plot showing the accuracy of 
disambiguator 46 in returning the correct analysis of the 
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input words in the sample. "Accuracy" is defined here as 
the probability that one of the analyses returned by the 
disambiguator is the "true" analysis, as chosen manually 
by a human reviewer, without regard to the number of 
5 "false" analyses that are returned at the same time. A 
curve 92 shows the accuracy of disambiguation over the 
entire sample, while a lower curve 94 shows the accuracy 
only with respect to ambiguous words (for which analyzer 
42 returns two or more analyses) . At low thresholds, the 
10 disambiguator prunes out relatively few of the analyses, 
so that the accuracy is close to 100%. (It is not 
& exactly 100%, because filter 44 occasionally removes the 

5'? correct lemma.). For large values of 8, the accuracy 

U 

drops. It will be observed, however, that for e = 0.1, 

* - i 

15 the accuracy of the disambiguator is maintained at 95%, 
gi while only one or two analyses are returned for 95% of 

the words in the sample, as mentioned above. 

0 Thus, by judicious choice of the threshold 

1 =1 

[11 parameter, a search index can be built and search queries 

Q 20 analyzed with enhanced precision, relative to methods 
know xn the art. "Precision" in this context refers to 
the proportion of relevant items out of the total nurriber 
of items that are retrieved in the search. The cost of 
this precision is a reduced level of "recall," meaning 
25 that relevant items will sometimes be missed, because the 
disambiguator has pruned out the "true" analysis of a 
term. Therefore, the threshold 8 is preferably chosen to 
give an optimal tradeoff between search efficiency and 
thoroughness . 

30 Alternatively, system 20 and the methods described 

hereinabove may be integrated in other linguistic 
processing applications, such as computerized natural 
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language processing. Furthermore, although system 20 is 
designed to operate on Hebrew language texts, the 
principles of morphological disambiguation described 
herein are also applicable to other morphologically-rich 
languages, including particularly other Semitic 
languages, such as Arabic. 

It will thus be appreciated that the preferred 
embodiments described above are cited by way of example, 
and that the present invention is not limited to what has 
been particularly shown and described hereinabove. 
Rather, the scope of the present invention includes both 
combinations and subcombinations of the various features 
described hereinabove, as well as variations and 
modifications thereof which would occur to persons 
skilled in the art upon reading the foregoing description 
and which are not disclosed in the prior art. 
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